O objetivo desse notebook é efetuar todo o processo de modelagem da base de dados adult, disponibilizada para o desafio do curso de introdução ao Machine Learning da Curso-R, utilizando o framework tidymodels. Ou seja, explorar, tratar, preparar, tunnar e escolher o modelo que melhor se ajusta aos dados disponibilizados. Vamos nessa!
As variáveis parecem estar com formatos corretos. Ponto de atenção para as variáveis wokclass, occupation e native_country, que apresentam valores missing.
Agora vamos analisar o comportamento das variáveis para definirmos como tratar os nossos dados para o modelo.
# DataExplorer::create_report(adult)
devtools::source_url("https://raw.githubusercontent.com/ricardomattos05/functions/master/function_AED_bivariada.R")
#
#
adult2 <- adult %>%
select(-id) %>%
mutate(resposta = if_else(resposta == ">50K", 1, 0))
#
#
# names(adult2)
for (i in 1:(length(adult2)-1) ) {
df <- adult2[,c(i,15)]
cat("### ",names(df[,1]),"\n")
print(AED_biv(df,glue("resposta"),"Pre"))
cat('\n\n')
}
Observações:
education : é possível visualizar que quanto maior o grau de escolaridade, maior a proporção de pessoas com salarios acima de 50k. E que as categorias abaixo de HS-grad, 1th-4th até 12thalém de serem pouco representativas, possuem baixa proporção, vamos então criar uma categoria uma nova consolidando elas HS-not-grad.
marital_status : aqui iremos agrupar os campos Married-AF-spouse e Married-civ-spouse, criando a categoria Married, baseado na similaridade entre elas com relação a variável resposta e considerando a descrição delas.
native_country : É um campo com pouca variabilidade, onde 90% dos dados estão atribuídos como “Estados Unidos”. Sendo assim, poderia considerar apenas Estados Unidos e agrupar o restante como outros, mas vamos manter o máximo de informação e reduzir as categorias para 3, agrupando todos os países que obtiveram proporção maior que a média, manter o valor mais representativo e uma categoria com os países abaixo da média.
relationship : campo contém os campos husband e wife, aparentemente poderiamos agrupa-los, vamos analisar mais afundo.
capital_loss e capital_gain : Aparentemente tanto quem ganha quanto quem perde algum valor apresentam maiores probabilidades de ter salario >50k. Vamos então avaliar a correlação entre elas.
workclass : Categorias com baixa representatividade como Never-workede Without-pay não possuem classificação com a resposta de interesse “>50k”, vamos dar um zoom nessa variável e analisar os NA’s que identificamos também.
ggplot(adult, aes(x = occupation, fill = resposta)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("occupation")
É possível ver que não faria sentido atribuir os NAs de forma modal, uma vez que nosso objetivo é obter o maior poder preditivo possível, logo, não queremos perder informação. Sendo assim, não vamos diluir os NAs na categoria com maior representatividade Prof-specialty, vamos atribuir à uma categoria com proporções similares e que possui uma boa representatividade, Farming-fishing.
ggplot(adult, aes(x = relationship)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("relationship")
ggplot(adult, aes(x = relationship, fill = resposta)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("relationship")
ggplot(adult, aes(x = relationship, fill = sex)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("relationship")
Vamos então balancear o gênero agrupando as categorias Wife e Husband, criando a categoria Married.
ggplot(adult, aes(x= capital_gain, y= capital_loss)) +
geom_point()
sum(adult$capital_loss > 0 & adult$capital_gain > 0)
Sendo assim, podemos soma-las e criar a variável capital_total sem medo de perder informação.
ggplot(adult, aes(x = workclass)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Workclass")
ggplot(adult, aes(x = workclass, fill = resposta)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Workclass")
Pelo visto a catgoria NA possui relação com a variável resposta distinta de todas as outras categorias, vamos então gerar uma nova categoria not-identify para atribuir os valores NA.
med <- (adult %>%
select(resposta) %>%
filter(resposta == ">50K") %>%
count() %>%
as.numeric())/nrow(adult)
tb_country<- adult %>%
select(native_country, resposta) %>%
group_by(native_country) %>%
count(resposta) %>%
mutate(prop = prop.table(n)) %>%
filter(resposta == ">50K") %>%
mutate( class = case_when( native_country == "United-States" ~ "United-States",
prop > med ~ ">mean",
prop <= med ~ "<=mean" ) )
tb_country %>%
select(native_country,class) %>%
group_by(class) %>%
count()
Ficamos então com 21 países com proporções abaixo da méda, 18 acima e “United-States” como as 3 categorias restantes.
A distribuição ficou com 5% para países acima da média e 5% para países abaixo da média.
Com nossa a análise exploratória concluída, vamos dar início as estapas da modelagem utilizando o framework do tidymodels.
Fazendo a separação dos dados em treino e teste para a modelagem.
Os tratamentos necessários observados na AED, que foi feita utilizando o pacote DataExplorer e a função AED_biv que gerei para entender o comportamento das variáveis com relação a variável resposta, serão armazenados utilizando o recipes para ser utilizado tanto para treinar os modelos como para testar posteriormente.
Especificando a validação cruzada:
set.seed(32)
adult_vfold <- vfold_cv(adult_train, v = 5, strata = resposta)
adult_vfold
# 5-fold cross-validation using stratification
Os modelos que serão ajustados:
Especificando modelo:
adult_tree
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = tune()
tree_depth = tune()
min_n = tune()
Computational engine: rpart
Workflow para decision tree:
Parâmentros:
hiperparams
Collection of 3 parameters for tuning
id parameter type object class
cost_complexity cost_complexity nparam[+]
tree_depth tree_depth nparam[+]
min_n min_n nparam[+]
Grid:
Efetuando tunagem de hiperparâmetros:
tree_tune <-
workflow_adult_tree %>%
tune_grid(
resamples = adult_vfold,
grid = tree_grid,
control = control_grid(save_pred = TRUE, verbose = T, allow_par = F),
metrics = metric_set(roc_auc)
)
i Fold1: recipe
v Fold1: recipe
i Fold1: model 1/10
v Fold1: model 1/10
i Fold1: model 1/10 (predictions)
i Fold1: model 2/10
v Fold1: model 2/10
i Fold1: model 2/10 (predictions)
i Fold1: model 3/10
v Fold1: model 3/10
i Fold1: model 3/10 (predictions)
i Fold1: model 4/10
v Fold1: model 4/10
i Fold1: model 4/10 (predictions)
i Fold1: model 5/10
v Fold1: model 5/10
i Fold1: model 5/10 (predictions)
i Fold1: model 6/10
v Fold1: model 6/10
i Fold1: model 6/10 (predictions)
i Fold1: model 7/10
v Fold1: model 7/10
i Fold1: model 7/10 (predictions)
i Fold1: model 8/10
v Fold1: model 8/10
i Fold1: model 8/10 (predictions)
i Fold1: model 9/10
v Fold1: model 9/10
i Fold1: model 9/10 (predictions)
i Fold1: model 10/10
v Fold1: model 10/10
i Fold1: model 10/10 (predictions)
i Fold2: recipe
v Fold2: recipe
i Fold2: model 1/10
v Fold2: model 1/10
i Fold2: model 1/10 (predictions)
i Fold2: model 2/10
v Fold2: model 2/10
i Fold2: model 2/10 (predictions)
i Fold2: model 3/10
v Fold2: model 3/10
i Fold2: model 3/10 (predictions)
i Fold2: model 4/10
v Fold2: model 4/10
i Fold2: model 4/10 (predictions)
i Fold2: model 5/10
v Fold2: model 5/10
i Fold2: model 5/10 (predictions)
i Fold2: model 6/10
v Fold2: model 6/10
i Fold2: model 6/10 (predictions)
i Fold2: model 7/10
v Fold2: model 7/10
i Fold2: model 7/10 (predictions)
i Fold2: model 8/10
v Fold2: model 8/10
i Fold2: model 8/10 (predictions)
i Fold2: model 9/10
v Fold2: model 9/10
i Fold2: model 9/10 (predictions)
i Fold2: model 10/10
v Fold2: model 10/10
i Fold2: model 10/10 (predictions)
i Fold3: recipe
v Fold3: recipe
i Fold3: model 1/10
v Fold3: model 1/10
i Fold3: model 1/10 (predictions)
i Fold3: model 2/10
v Fold3: model 2/10
i Fold3: model 2/10 (predictions)
i Fold3: model 3/10
v Fold3: model 3/10
i Fold3: model 3/10 (predictions)
i Fold3: model 4/10
v Fold3: model 4/10
i Fold3: model 4/10 (predictions)
i Fold3: model 5/10
v Fold3: model 5/10
i Fold3: model 5/10 (predictions)
i Fold3: model 6/10
v Fold3: model 6/10
i Fold3: model 6/10 (predictions)
i Fold3: model 7/10
v Fold3: model 7/10
i Fold3: model 7/10 (predictions)
i Fold3: model 8/10
v Fold3: model 8/10
i Fold3: model 8/10 (predictions)
i Fold3: model 9/10
v Fold3: model 9/10
i Fold3: model 9/10 (predictions)
i Fold3: model 10/10
v Fold3: model 10/10
i Fold3: model 10/10 (predictions)
i Fold4: recipe
v Fold4: recipe
i Fold4: model 1/10
v Fold4: model 1/10
i Fold4: model 1/10 (predictions)
i Fold4: model 2/10
v Fold4: model 2/10
i Fold4: model 2/10 (predictions)
i Fold4: model 3/10
v Fold4: model 3/10
i Fold4: model 3/10 (predictions)
i Fold4: model 4/10
v Fold4: model 4/10
i Fold4: model 4/10 (predictions)
i Fold4: model 5/10
v Fold4: model 5/10
i Fold4: model 5/10 (predictions)
i Fold4: model 6/10
v Fold4: model 6/10
i Fold4: model 6/10 (predictions)
i Fold4: model 7/10
v Fold4: model 7/10
i Fold4: model 7/10 (predictions)
i Fold4: model 8/10
v Fold4: model 8/10
i Fold4: model 8/10 (predictions)
i Fold4: model 9/10
v Fold4: model 9/10
i Fold4: model 9/10 (predictions)
i Fold4: model 10/10
v Fold4: model 10/10
i Fold4: model 10/10 (predictions)
i Fold5: recipe
v Fold5: recipe
i Fold5: model 1/10
v Fold5: model 1/10
i Fold5: model 1/10 (predictions)
i Fold5: model 2/10
v Fold5: model 2/10
i Fold5: model 2/10 (predictions)
i Fold5: model 3/10
v Fold5: model 3/10
i Fold5: model 3/10 (predictions)
i Fold5: model 4/10
v Fold5: model 4/10
i Fold5: model 4/10 (predictions)
i Fold5: model 5/10
v Fold5: model 5/10
i Fold5: model 5/10 (predictions)
i Fold5: model 6/10
v Fold5: model 6/10
i Fold5: model 6/10 (predictions)
i Fold5: model 7/10
v Fold5: model 7/10
i Fold5: model 7/10 (predictions)
i Fold5: model 8/10
v Fold5: model 8/10
i Fold5: model 8/10 (predictions)
i Fold5: model 9/10
v Fold5: model 9/10
i Fold5: model 9/10 (predictions)
i Fold5: model 10/10
v Fold5: model 10/10
i Fold5: model 10/10 (predictions)
Finalizando WF:
workflow_tree_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: decision_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = 1.17576363081513e-05
tree_depth = 8
min_n = 13
Computational engine: rpart
Verificando importância dos atributos:
Modelo final:
Especificando modelo:
adult_rf
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
min_n = tune()
Computational engine: randomForest
Workflow para random forest:
Grid:
Efetuando tunagem de hiperparâmetros:
set.seed(123)
rf_tune<-
workflow_adult_rf %>%
tune_grid(
resamples = adult_vfold,
grid = rf_grid,
control = control_grid(save_pred = TRUE, verbose = T, allow_par = T),
metrics = metric_set(roc_auc)
)
i Fold1: recipe
v Fold1: recipe
i Fold1: model 1/10
v Fold1: model 1/10
i Fold1: model 1/10 (predictions)
i Fold1: model 2/10
v Fold1: model 2/10
i Fold1: model 2/10 (predictions)
i Fold1: model 3/10
v Fold1: model 3/10
i Fold1: model 3/10 (predictions)
i Fold1: model 4/10
v Fold1: model 4/10
i Fold1: model 4/10 (predictions)
i Fold1: model 5/10
v Fold1: model 5/10
i Fold1: model 5/10 (predictions)
i Fold1: model 6/10
v Fold1: model 6/10
i Fold1: model 6/10 (predictions)
i Fold1: model 7/10
v Fold1: model 7/10
i Fold1: model 7/10 (predictions)
i Fold1: model 8/10
v Fold1: model 8/10
i Fold1: model 8/10 (predictions)
i Fold1: model 9/10
v Fold1: model 9/10
i Fold1: model 9/10 (predictions)
i Fold1: model 10/10
v Fold1: model 10/10
i Fold1: model 10/10 (predictions)
i Fold2: recipe
v Fold2: recipe
i Fold2: model 1/10
v Fold2: model 1/10
i Fold2: model 1/10 (predictions)
i Fold2: model 2/10
v Fold2: model 2/10
i Fold2: model 2/10 (predictions)
i Fold2: model 3/10
v Fold2: model 3/10
i Fold2: model 3/10 (predictions)
i Fold2: model 4/10
v Fold2: model 4/10
i Fold2: model 4/10 (predictions)
i Fold2: model 5/10
v Fold2: model 5/10
i Fold2: model 5/10 (predictions)
i Fold2: model 6/10
v Fold2: model 6/10
i Fold2: model 6/10 (predictions)
i Fold2: model 7/10
v Fold2: model 7/10
i Fold2: model 7/10 (predictions)
i Fold2: model 8/10
v Fold2: model 8/10
i Fold2: model 8/10 (predictions)
i Fold2: model 9/10
v Fold2: model 9/10
i Fold2: model 9/10 (predictions)
i Fold2: model 10/10
v Fold2: model 10/10
i Fold2: model 10/10 (predictions)
i Fold3: recipe
v Fold3: recipe
i Fold3: model 1/10
v Fold3: model 1/10
i Fold3: model 1/10 (predictions)
i Fold3: model 2/10
v Fold3: model 2/10
i Fold3: model 2/10 (predictions)
i Fold3: model 3/10
v Fold3: model 3/10
i Fold3: model 3/10 (predictions)
i Fold3: model 4/10
v Fold3: model 4/10
i Fold3: model 4/10 (predictions)
i Fold3: model 5/10
v Fold3: model 5/10
i Fold3: model 5/10 (predictions)
i Fold3: model 6/10
v Fold3: model 6/10
i Fold3: model 6/10 (predictions)
i Fold3: model 7/10
v Fold3: model 7/10
i Fold3: model 7/10 (predictions)
i Fold3: model 8/10
v Fold3: model 8/10
i Fold3: model 8/10 (predictions)
i Fold3: model 9/10
v Fold3: model 9/10
i Fold3: model 9/10 (predictions)
i Fold3: model 10/10
v Fold3: model 10/10
i Fold3: model 10/10 (predictions)
i Fold4: recipe
v Fold4: recipe
i Fold4: model 1/10
v Fold4: model 1/10
i Fold4: model 1/10 (predictions)
i Fold4: model 2/10
v Fold4: model 2/10
i Fold4: model 2/10 (predictions)
i Fold4: model 3/10
v Fold4: model 3/10
i Fold4: model 3/10 (predictions)
i Fold4: model 4/10
v Fold4: model 4/10
i Fold4: model 4/10 (predictions)
i Fold4: model 5/10
v Fold4: model 5/10
i Fold4: model 5/10 (predictions)
i Fold4: model 6/10
v Fold4: model 6/10
i Fold4: model 6/10 (predictions)
i Fold4: model 7/10
v Fold4: model 7/10
i Fold4: model 7/10 (predictions)
i Fold4: model 8/10
v Fold4: model 8/10
i Fold4: model 8/10 (predictions)
i Fold4: model 9/10
v Fold4: model 9/10
i Fold4: model 9/10 (predictions)
i Fold4: model 10/10
v Fold4: model 10/10
i Fold4: model 10/10 (predictions)
i Fold5: recipe
v Fold5: recipe
i Fold5: model 1/10
v Fold5: model 1/10
i Fold5: model 1/10 (predictions)
i Fold5: model 2/10
v Fold5: model 2/10
i Fold5: model 2/10 (predictions)
i Fold5: model 3/10
v Fold5: model 3/10
i Fold5: model 3/10 (predictions)
i Fold5: model 4/10
v Fold5: model 4/10
i Fold5: model 4/10 (predictions)
i Fold5: model 5/10
v Fold5: model 5/10
i Fold5: model 5/10 (predictions)
i Fold5: model 6/10
v Fold5: model 6/10
i Fold5: model 6/10 (predictions)
i Fold5: model 7/10
v Fold5: model 7/10
i Fold5: model 7/10 (predictions)
i Fold5: model 8/10
v Fold5: model 8/10
i Fold5: model 8/10 (predictions)
i Fold5: model 9/10
v Fold5: model 9/10
i Fold5: model 9/10 (predictions)
i Fold5: model 10/10
v Fold5: model 10/10
i Fold5: model 10/10 (predictions)
Finalizando WF:
workflow_rf_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Random Forest Model Specification (classification)
Main Arguments:
mtry = 42
trees = 1406
min_n = 21
Computational engine: randomForest
Verificando importância dos atributos:
workflow_rf_final %>%
fit(adult_train) %>%
pull_workflow_fit() %>%
vip::vip(geom = "col")
Modelo final:
rf_final <- last_fit(workflow_rf_final, adult_split)
collect_metrics(rf_final) #roc_auc = 0.9072885
Tunando mtry, trees e sample size:
adult_xgb
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
min_n = tune()
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
Workflow para Xgboost:
Grid:
Efetuando tunagem de hiperparâmetros:
ini <- Sys.time()
xgb_tune <-
workflow_adult_xgb %>%
tune_grid(
resamples = adult_vfold,
grid = xgb_grid,
control = control_grid(verbose = TRUE),
metrics = metric_set(roc_auc)
)
i Fold1: recipe
v Fold1: recipe
i Fold1: model 1/20
v Fold1: model 1/20
i Fold1: model 1/20 (predictions)
i Fold1: model 2/20
v Fold1: model 2/20
i Fold1: model 2/20 (predictions)
i Fold1: model 3/20
v Fold1: model 3/20
i Fold1: model 3/20 (predictions)
i Fold1: model 4/20
v Fold1: model 4/20
i Fold1: model 4/20 (predictions)
i Fold1: model 5/20
v Fold1: model 5/20
i Fold1: model 5/20 (predictions)
i Fold1: model 6/20
v Fold1: model 6/20
i Fold1: model 6/20 (predictions)
i Fold1: model 7/20
v Fold1: model 7/20
i Fold1: model 7/20 (predictions)
i Fold1: model 8/20
v Fold1: model 8/20
i Fold1: model 8/20 (predictions)
i Fold1: model 9/20
v Fold1: model 9/20
i Fold1: model 9/20 (predictions)
i Fold1: model 10/20
v Fold1: model 10/20
i Fold1: model 10/20 (predictions)
i Fold1: model 11/20
v Fold1: model 11/20
i Fold1: model 11/20 (predictions)
i Fold1: model 12/20
v Fold1: model 12/20
i Fold1: model 12/20 (predictions)
i Fold1: model 13/20
v Fold1: model 13/20
i Fold1: model 13/20 (predictions)
i Fold1: model 14/20
v Fold1: model 14/20
i Fold1: model 14/20 (predictions)
i Fold1: model 15/20
v Fold1: model 15/20
i Fold1: model 15/20 (predictions)
i Fold1: model 16/20
v Fold1: model 16/20
i Fold1: model 16/20 (predictions)
i Fold1: model 17/20
v Fold1: model 17/20
i Fold1: model 17/20 (predictions)
i Fold1: model 18/20
v Fold1: model 18/20
i Fold1: model 18/20 (predictions)
i Fold1: model 19/20
v Fold1: model 19/20
i Fold1: model 19/20 (predictions)
i Fold1: model 20/20
v Fold1: model 20/20
i Fold1: model 20/20 (predictions)
i Fold2: recipe
v Fold2: recipe
i Fold2: model 1/20
v Fold2: model 1/20
i Fold2: model 1/20 (predictions)
i Fold2: model 2/20
v Fold2: model 2/20
i Fold2: model 2/20 (predictions)
i Fold2: model 3/20
v Fold2: model 3/20
i Fold2: model 3/20 (predictions)
i Fold2: model 4/20
v Fold2: model 4/20
i Fold2: model 4/20 (predictions)
i Fold2: model 5/20
v Fold2: model 5/20
i Fold2: model 5/20 (predictions)
i Fold2: model 6/20
v Fold2: model 6/20
i Fold2: model 6/20 (predictions)
i Fold2: model 7/20
v Fold2: model 7/20
i Fold2: model 7/20 (predictions)
i Fold2: model 8/20
v Fold2: model 8/20
i Fold2: model 8/20 (predictions)
i Fold2: model 9/20
v Fold2: model 9/20
i Fold2: model 9/20 (predictions)
i Fold2: model 10/20
v Fold2: model 10/20
i Fold2: model 10/20 (predictions)
i Fold2: model 11/20
v Fold2: model 11/20
i Fold2: model 11/20 (predictions)
i Fold2: model 12/20
v Fold2: model 12/20
i Fold2: model 12/20 (predictions)
i Fold2: model 13/20
v Fold2: model 13/20
i Fold2: model 13/20 (predictions)
i Fold2: model 14/20
v Fold2: model 14/20
i Fold2: model 14/20 (predictions)
i Fold2: model 15/20
v Fold2: model 15/20
i Fold2: model 15/20 (predictions)
i Fold2: model 16/20
v Fold2: model 16/20
i Fold2: model 16/20 (predictions)
i Fold2: model 17/20
v Fold2: model 17/20
i Fold2: model 17/20 (predictions)
i Fold2: model 18/20
v Fold2: model 18/20
i Fold2: model 18/20 (predictions)
i Fold2: model 19/20
v Fold2: model 19/20
i Fold2: model 19/20 (predictions)
i Fold2: model 20/20
v Fold2: model 20/20
i Fold2: model 20/20 (predictions)
i Fold3: recipe
v Fold3: recipe
i Fold3: model 1/20
v Fold3: model 1/20
i Fold3: model 1/20 (predictions)
i Fold3: model 2/20
v Fold3: model 2/20
i Fold3: model 2/20 (predictions)
i Fold3: model 3/20
v Fold3: model 3/20
i Fold3: model 3/20 (predictions)
i Fold3: model 4/20
v Fold3: model 4/20
i Fold3: model 4/20 (predictions)
i Fold3: model 5/20
v Fold3: model 5/20
i Fold3: model 5/20 (predictions)
i Fold3: model 6/20
v Fold3: model 6/20
i Fold3: model 6/20 (predictions)
i Fold3: model 7/20
v Fold3: model 7/20
i Fold3: model 7/20 (predictions)
i Fold3: model 8/20
v Fold3: model 8/20
i Fold3: model 8/20 (predictions)
i Fold3: model 9/20
v Fold3: model 9/20
i Fold3: model 9/20 (predictions)
i Fold3: model 10/20
v Fold3: model 10/20
i Fold3: model 10/20 (predictions)
i Fold3: model 11/20
v Fold3: model 11/20
i Fold3: model 11/20 (predictions)
i Fold3: model 12/20
v Fold3: model 12/20
i Fold3: model 12/20 (predictions)
i Fold3: model 13/20
v Fold3: model 13/20
i Fold3: model 13/20 (predictions)
i Fold3: model 14/20
v Fold3: model 14/20
i Fold3: model 14/20 (predictions)
i Fold3: model 15/20
v Fold3: model 15/20
i Fold3: model 15/20 (predictions)
i Fold3: model 16/20
v Fold3: model 16/20
i Fold3: model 16/20 (predictions)
i Fold3: model 17/20
v Fold3: model 17/20
i Fold3: model 17/20 (predictions)
i Fold3: model 18/20
v Fold3: model 18/20
i Fold3: model 18/20 (predictions)
i Fold3: model 19/20
v Fold3: model 19/20
i Fold3: model 19/20 (predictions)
i Fold3: model 20/20
v Fold3: model 20/20
i Fold3: model 20/20 (predictions)
i Fold4: recipe
v Fold4: recipe
i Fold4: model 1/20
v Fold4: model 1/20
i Fold4: model 1/20 (predictions)
i Fold4: model 2/20
v Fold4: model 2/20
i Fold4: model 2/20 (predictions)
i Fold4: model 3/20
v Fold4: model 3/20
i Fold4: model 3/20 (predictions)
i Fold4: model 4/20
v Fold4: model 4/20
i Fold4: model 4/20 (predictions)
i Fold4: model 5/20
v Fold4: model 5/20
i Fold4: model 5/20 (predictions)
i Fold4: model 6/20
v Fold4: model 6/20
i Fold4: model 6/20 (predictions)
i Fold4: model 7/20
v Fold4: model 7/20
i Fold4: model 7/20 (predictions)
i Fold4: model 8/20
v Fold4: model 8/20
i Fold4: model 8/20 (predictions)
i Fold4: model 9/20
v Fold4: model 9/20
i Fold4: model 9/20 (predictions)
i Fold4: model 10/20
v Fold4: model 10/20
i Fold4: model 10/20 (predictions)
i Fold4: model 11/20
v Fold4: model 11/20
i Fold4: model 11/20 (predictions)
i Fold4: model 12/20
v Fold4: model 12/20
i Fold4: model 12/20 (predictions)
i Fold4: model 13/20
v Fold4: model 13/20
i Fold4: model 13/20 (predictions)
i Fold4: model 14/20
v Fold4: model 14/20
i Fold4: model 14/20 (predictions)
i Fold4: model 15/20
v Fold4: model 15/20
i Fold4: model 15/20 (predictions)
i Fold4: model 16/20
v Fold4: model 16/20
i Fold4: model 16/20 (predictions)
i Fold4: model 17/20
v Fold4: model 17/20
i Fold4: model 17/20 (predictions)
i Fold4: model 18/20
v Fold4: model 18/20
i Fold4: model 18/20 (predictions)
i Fold4: model 19/20
v Fold4: model 19/20
i Fold4: model 19/20 (predictions)
i Fold4: model 20/20
v Fold4: model 20/20
i Fold4: model 20/20 (predictions)
i Fold5: recipe
v Fold5: recipe
i Fold5: model 1/20
v Fold5: model 1/20
i Fold5: model 1/20 (predictions)
i Fold5: model 2/20
v Fold5: model 2/20
i Fold5: model 2/20 (predictions)
i Fold5: model 3/20
v Fold5: model 3/20
i Fold5: model 3/20 (predictions)
i Fold5: model 4/20
v Fold5: model 4/20
i Fold5: model 4/20 (predictions)
i Fold5: model 5/20
v Fold5: model 5/20
i Fold5: model 5/20 (predictions)
i Fold5: model 6/20
v Fold5: model 6/20
i Fold5: model 6/20 (predictions)
i Fold5: model 7/20
v Fold5: model 7/20
i Fold5: model 7/20 (predictions)
i Fold5: model 8/20
v Fold5: model 8/20
i Fold5: model 8/20 (predictions)
i Fold5: model 9/20
v Fold5: model 9/20
i Fold5: model 9/20 (predictions)
i Fold5: model 10/20
v Fold5: model 10/20
i Fold5: model 10/20 (predictions)
i Fold5: model 11/20
v Fold5: model 11/20
i Fold5: model 11/20 (predictions)
i Fold5: model 12/20
v Fold5: model 12/20
i Fold5: model 12/20 (predictions)
i Fold5: model 13/20
v Fold5: model 13/20
i Fold5: model 13/20 (predictions)
i Fold5: model 14/20
v Fold5: model 14/20
i Fold5: model 14/20 (predictions)
i Fold5: model 15/20
v Fold5: model 15/20
i Fold5: model 15/20 (predictions)
i Fold5: model 16/20
v Fold5: model 16/20
i Fold5: model 16/20 (predictions)
i Fold5: model 17/20
v Fold5: model 17/20
i Fold5: model 17/20 (predictions)
i Fold5: model 18/20
v Fold5: model 18/20
i Fold5: model 18/20 (predictions)
i Fold5: model 19/20
v Fold5: model 19/20
i Fold5: model 19/20 (predictions)
i Fold5: model 20/20
v Fold5: model 20/20
i Fold5: model 20/20 (predictions)
Sys.time()- ini #Time difference of 1.034167 hours ( 35 mins with parallel)
Time difference of 39.9844 mins
Finalizando WF:
workflow_xgb_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445048353615
Computational engine: xgboost
Verificando importância dos atributos:
workflow_xgb_final %>%
fit(adult_train) %>%
pull_workflow_fit() %>%
vip::vip(geom = "col")
Modelo final:
Tunando mtry, trees e sample size:
adult_xgb2
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
Workflow para Xgboost:
workflow_adult_xgb2
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
Grid:
Efetuando tunagem de hiperparâmetros:
getDoParWorkers()
[1] 3
Finalizando WF:
workflow_xgb_final2
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445
loss_reduction = 2.08764734275328
sample_size = 0.869696747139096
Computational engine: xgboost
Verificando importância dos atributos:
workflow_xgb_final2 %>%
fit(adult_train) %>%
pull_workflow_fit() %>%
vip::vip(geom = "col")
Modelo final:
Tunando mtry, trees e sample size:
adult_xgb6
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 33
trees = 1347
min_n = 5
tree_depth = 7
learn_rate = 0.01566693
Computational engine: xgboost
Workflow para Xgboost:
Finalizando WF:
workflow_xgb_final6
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 33
trees = 1347
min_n = 5
tree_depth = 7
learn_rate = 0.01566693
Computational engine: xgboost
Verificando importância dos atributos:
workflow_xgb_final6 %>%
fit(adult_train) %>%
pull_workflow_fit() %>%
vip::vip(geom = "col")
Modelo final:
Matriz de confusão:
adult_val %>%
transmute(resposta = factor(resposta, levels = c(">50K", "<=50K")),
more_than_50k = ifelse(more_than_50k > 0.5, ">50K", "<=50K") %>%
factor(levels = c(">50K", "<=50K"))) %>%
table() %>%
caret::confusionMatrix()
Confusion Matrix and Statistics
more_than_50k
resposta >50K <=50K
>50K 2512 1334
<=50K 726 11709
Accuracy : 0.8735
95% CI : (0.8683, 0.8785)
No Information Rate : 0.8011
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6291
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7758
Specificity : 0.8977
Pos Pred Value : 0.6531
Neg Pred Value : 0.9416
Prevalence : 0.1989
Detection Rate : 0.1543
Detection Prevalence : 0.2362
Balanced Accuracy : 0.8368
'Positive' Class : >50K